yufan_yin_week4: 6.10. - 12.10.2020
Also see in the page to my course diary: https://yufanyin.github.io/datavis-R/
Read the region_scores.csv data
region_scores <- read.csv(file = "D:/Users/yinyf/datavis-R/week4/region_scores.csv", stringsAsFactors = TRUE)
region_scores <- region_scores %>%
mutate(id = as.character(id),
region = factor(region),
education = factor(education, ordered = TRUE),
gender = factor(gender))
glimpse(region_scores)
## Rows: 240
## Columns: 6
## $ id <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", ...
## $ region <fct> South Karelia, Satakunta, Kymenlaakso, South Karelia, Sou...
## $ education <ord> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ gender <fct> M, F, M, F, F, F, M, F, M, M, F, M, F, M, M, M, F, M, F, ...
## $ age <int> 56, 41, 48, 41, 35, 60, 28, 28, 48, 51, 45, 55, 41, 24, 6...
## $ score <dbl> 4.268811, 5.646586, 6.949019, 7.096777, 6.990985, 5.26766...
Cutting values (score) into intervals
to groups of width 10
region_scores %>%
mutate(score_group = cut_width(score, 10, boundary = 0)) %>%
count(score_group)
## score_group n
## 1 [0,10] 55
## 2 (10,20] 154
## 3 (20,30] 31
region_scores <- region_scores %>%
mutate(score_group = cut_width(score, 10, boundary = 0,
labels = c('-10','11-20','21-')))
region_scores %>%
distinct(score_group)
## score_group
## 1 -10
## 2 11-20
## 3 21-
Column score_group is not found.
region_scores2 <- region_scores %>%
group_by(education, score_group, .drop = FALSE) %>%
summarise(mean_age = mean(age),
sd_age = sd(age),
n = n()) %>%
ungroup()
## `summarise()` regrouping output by 'education' (override with `.groups` argument)
region_scores2
## # A tibble: 9 x 5
## education score_group mean_age sd_age n
## <fct> <fct> <dbl> <dbl> <int>
## 1 1 -10 39.5 10.1 46
## 2 1 11-20 38.8 10.2 39
## 3 1 21- NaN NA 0
## 4 2 -10 45 9.27 9
## 5 2 11-20 42.2 9.61 65
## 6 2 21- 39.3 7.57 3
## 7 3 -10 NaN NA 0
## 8 3 11-20 40.1 10.4 50
## 9 3 21- 37.4 8.97 28
Create a figure that shows the distributions (density plots or histograms) of age and score in separate subplots (facets). What do you need to do first?
Note: I’m not sure the group varible to create subplots.
In the figure, set individual x-axis limits for age and score by modifying the scales parameter within facet_wrap().
Question: What went wrong when I used facet_wrap() but saw the warning ‘Layer 1 is missing score_group(or other group variable)’ ? I met last week, too. I saved score_group.
region_scores %>%
ggplot(aes(age, fill = score_group)) +
geom_histogram(position = "identity", alpha = .5, binwidth = 1)
(Try more as a reminder in future)
region_scores %>%
ggplot(aes(age, fill = gender)) +
geom_histogram(position = "identity", alpha = .5, binwidth = 1)
region_scores %>%
ggplot(aes(score, fill = gender)) +
geom_histogram(position = "identity", alpha = .5, binwidth = 1)
Note: I do not understand the meaning of y-axis in such density plots.
region_scores %>%
ggplot(aes(age, fill = gender)) +
geom_density(alpha = .5)
region_scores %>%
ggplot(aes(score, fill = gender)) +
geom_density(alpha = .5)
In this exercise, you will use the built-in iris dataset.
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Make the data into long format: gather all variables except species into new variables var (variable names) and measure (numerical values). You should end up with 600 rows and 3 columns (Species, var, and measure). Assign the result into iris_long.
iris_long <- iris %>%
gather(var, measure, -Species)
str(iris_long)
## 'data.frame': 600 obs. of 3 variables:
## $ Species: Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ var : chr "Sepal.Length" "Sepal.Length" "Sepal.Length" "Sepal.Length" ...
## $ measure: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
In iris_long, separate var into two variables: part (Sepal/Petal values) and dim (Length/Width).
Then, spread the measurement values to new columns that get their names from dim. You must create row numbers by dim group before doing this.
You should now have 300 rows of variables Species, part, Length and Width (and row numbers). Assign the result into iris_wide.
Note: It was a bit complex than the example. I tried many times but failed. So I kept some of the codes in the following chunk.
iris_long %>%
group_by(Species) %>%
mutate(row = row_number()) %>%
ungroup %>%
spread(?, ?) %>%
select(-row)
However,
Must extract column with a single valid subscript. x Subscript `var` has the wrong type `data.frame<Sepal.Width:double>`. i It must be numeric or character.
Or:
iris_long %>%
pivot_wider(names_from = c(var),
values_from = measure)
## Warning: Values are not uniquely identified; output will contain list-cols.
## * Use `values_fn = list` to suppress this warning.
## * Use `values_fn = length` to identify where the duplicates arise
## * Use `values_fn = {summary_fun}` to summarise duplicates
## # A tibble: 3 x 5
## Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## <fct> <list> <list> <list> <list>
## 1 setosa <dbl [50]> <dbl [50]> <dbl [50]> <dbl [50]>
## 2 versicolor <dbl [50]> <dbl [50]> <dbl [50]> <dbl [50]>
## 3 virginica <dbl [50]> <dbl [50]> <dbl [50]> <dbl [50]>
There is still error.
Using iris_wide, plot a scatter plot of length on the x-axis and width on the y-axis. Colour the points by part.
iris_wide %>%
ggplot(aes(Length, Width), color = Species) + # x = length, y = width
geom_point()
Import your data into R. Check that you have the correct number of rows and columns, column names are in place, the encoding of characters looks OK, etc.
learning2019_w4 <- read.csv(file = "D:/Users/yinyf/datavis-R/week0/learning2019_week4.csv", stringsAsFactors = TRUE)
Print the structure/glimpse/summary of the data. Outline briefly what kind of variables you have and if there are any missing or abnormal values. Make sure that each variable has the right class (numeric/character/factor etc).
learning_w4 <- learning2019_w4 %>%
mutate(studylength = as.numeric(studylength),
writingcourse = as.numeric(writingcourse))
str(learning_w4)
## 'data.frame': 206 obs. of 10 variables:
## $ 锘縞luster : int 3 2 1 1 3 1 2 2 1 3 ...
## $ unref : num 4 2 3 2 3 2.67 1 2.33 3 3.67 ...
## $ deep : num 3.5 4.25 3.75 4.25 3.25 3.5 4.25 4.25 4 4 ...
## $ orga : num 3.33 3 4.33 3.67 2.67 4 2.33 3.33 4 3.67 ...
## $ blocks : num 3.33 3.67 3.67 3 3.67 4 2.67 2.33 3.33 2.67 ...
## $ procrastination: num 3.25 4.25 3.75 2.5 4.25 3.5 3.5 4.25 3.25 2.5 ...
## $ gender : int 2 2 2 2 2 2 2 2 2 2 ...
## $ studentstatus : int 1 1 1 1 1 1 1 1 1 2 ...
## $ studylength : num 39 51 3 3 15 3 3 3 3 3 ...
## $ writingcourse : num 2 3 4 0 0 11 0 0 44 35 ...
Pick a few (2-5) variables of interest from your data (ideally, both categorical and numerical).
For categorical variables, count the observations in each category (or combination of categories). Are the frequencies balanced?
learning19_w4 %>%
count(cluster, gender) %>%
arrange(desc(n)) %>%
arrange(cluster)
Error: Must group by variables found in .data. * Column cluster is not found. Neither is learning19_w4[1]. Well… I’m not very angry.
For numerical variables, compute some summary statistics (e.g. min, max, mean, median, SD) over the whole dataset or for subgroups. What can you say about the distributions of these variables, or possible group-wise differences?
Overall:
summary(learning_w4)
## 锘縞luster unref deep orga
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.670 1st Qu.:3.750 1st Qu.:2.670
## Median :2.000 Median :2.000 Median :4.000 Median :3.330
## Mean :1.718 Mean :2.178 Mean :4.007 Mean :3.411
## 3rd Qu.:2.000 3rd Qu.:2.670 3rd Qu.:4.500 3rd Qu.:4.000
## Max. :3.000 Max. :5.000 Max. :5.000 Max. :5.000
## blocks procrastination gender studentstatus
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:2.500 1st Qu.:1.000 1st Qu.:2.000
## Median :2.670 Median :3.250 Median :2.000 Median :2.000
## Mean :2.655 Mean :3.212 Mean :1.714 Mean :1.767
## 3rd Qu.:3.330 3rd Qu.:3.750 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :5.000 Max. :5.000 Max. :2.000 Max. :2.000
## studylength writingcourse
## Min. : 2.00 Min. : 0.000
## 1st Qu.: 5.00 1st Qu.: 0.000
## Median : 14.00 Median : 3.000
## Mean : 19.75 Mean : 6.694
## 3rd Qu.: 28.00 3rd Qu.: 6.000
## Max. :172.00 Max. :91.000
For subgroups:
**Note:" I do not believe the mean values of subgroups divided by gender or student status(Bechelor/Master) could be equal. What’s wrong?
grouped_df <- learning_w4 %>%
group_by(studentstatus)
grouped_df %>%
summarise(unref_mean = mean(learning_w4$unref), deep_mean = mean(learning_w4$deep), orga_mean = mean(learning_w4$deep))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 4
## studentstatus unref_mean deep_mean orga_mean
## <int> <dbl> <dbl> <dbl>
## 1 1 2.18 4.01 4.01
## 2 2 2.18 4.01 4.01
We can see studylength (how many month students have been studied in the university) is a better grouping value than (numbers) of writingcourse. But …
Try cluster (student profile based on the combination of scores on ‘unref’, ‘deep’ and ‘orga’)
learning_w4 %>%
count(learning_w4[1])
## 锘縞luster n
## 1 1 94
## 2 2 76
## 3 3 36
grouped_learning <- learning_w4 %>%
group_by(learning_w4[1])
grouped_learning %>%
summarise(unref_mean = mean(grouped_learning$unref), deep_mean = mean(grouped_learning$deep), orga_mean = mean(grouped_learning$orga))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 4
## 锘縞luster unref_mean deep_mean orga_mean
## <int> <dbl> <dbl> <dbl>
## 1 1 2.18 4.01 3.41
## 2 2 2.18 4.01 3.41
## 3 3 2.18 4.01 3.41
# the results look strange but I do not know what went wrong
Describe if there’s anything else you think should be done as “pre-processing” steps (e.g. recoding/grouping values, renaming variables, removing variables or mutating new ones, reshaping the data to long format, merging data frames together).
Do you have an idea of what kind of relationships in your data you would like to visualise and for which variables? For example, would you like to depict variable distributions, the structure of multilevel data, summary statistics (e.g. means), or include model fits or predictions?
Structure of the data
learning2019 <- read.csv(file = "D:/Users/yinyf/datavis-R/week0/learning2019_w4.csv", stringsAsFactors = TRUE)
learning19 <- learning2019[1:13]
str(learning19)
## 'data.frame': 211 obs. of 13 variables:
## $ 锘縞luster : int 3 2 1 1 3 1 2 2 1 3 ...
## $ unref : num 4 2 3 2 3 ...
## $ deep : num 3.5 4.25 3.75 4.25 3.25 3.5 4.25 4.25 4 4 ...
## $ orga : num 3.33 3 4.33 3.67 2.67 ...
## $ blocks : num 3.33 3.67 3.67 3 3.67 ...
## $ procrastination: num 3.25 4.25 3.75 2.5 4.25 3.5 3.5 4.25 3.25 2.5 ...
## $ perfectionism : num 3.67 3.33 3.33 2.67 2.33 ...
## $ innateability : num 1 1.5 3 1.5 2.5 2 2 1 2.5 1 ...
## $ ktransforming : num 4 3.67 3.67 3.33 4 ...
## $ productivity : num 1.25 2 1.25 2.25 2.25 2.5 3 2.25 2.25 3.75 ...
## $ gender : int 2 2 2 2 2 2 2 2 2 2 ...
## $ studentstatus : int 1 1 1 1 1 1 1 1 1 3 ...
## $ studylength : int 39 51 3 3 15 3 3 3 3 3 ...
The aim of the study is to investigate the interrelationships between the approaches to learning and conceptions of academic writing among international university students. Altogether 218 international students of the university participated in the study in 2018 and 2019. Students were divided into homogeneous groups based on their Z scores on the three approaches to learning. Then we compare mean differences and ANOVA results between the profiles.
The data ‘learning2019’ consists of 218 observations and 17 variables. It contains their scores of approaches to learning (different ways that students process information: unreflective studying, deep approach to learning and organised studying), conceptions of academic writing (blocks, procrastination, perfectionism, innate ability, knowledge transforming and productivity), and some background information (categorical variables, eg:gender, age, faculty, student status and study length).
The explanation of some columns are as follows. Each of them was average value of 2-4 questions in 5-point Likert scale (1= totally disagree, 5 = fully agree).
“unref”: relying on memorisation in the learning process, lacking the reflective approach to studying and applying the fragmented knowledge base.
“deep”: comprehending the intentional content, using evidence and integrating with previous knowledge.
“orga”: time management, study organisation, effort management and concentration.
“blocks”: the inability to write productively whose reason is not intellectual capacity or literary skills.
“procrastination”: failing to start or postponing tasks like preparing for exams and doing homework.
“perfectionism”: setting overly high standards, pursuing flawlessness, and evaluating one’s behavior critically.
“innateability”: writing is a skill which “is determined at birth” or “cannot be taught or developed”.
“ktransforming”: (knowledge transforming) using writing for developing knowledge and generating new ideas and in the reflective and dialectic processes.
“productivity”: (sense of productivity) part of self-efficacy in writing.
summary(learning19)
## 锘縞luster unref deep orga
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.667 1st Qu.:3.750 1st Qu.:2.667
## Median :2.000 Median :2.000 Median :4.000 Median :3.333
## Mean :1.716 Mean :2.171 Mean :4.007 Mean :3.414
## 3rd Qu.:2.000 3rd Qu.:2.667 3rd Qu.:4.500 3rd Qu.:4.000
## Max. :3.000 Max. :5.000 Max. :5.000 Max. :5.000
## blocks procrastination perfectionism innateability
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:2.500 1st Qu.:2.000 1st Qu.:1.000
## Median :2.667 Median :3.250 Median :2.333 Median :1.500
## Mean :2.662 Mean :3.219 Mean :2.556 Mean :1.761
## 3rd Qu.:3.333 3rd Qu.:3.875 3rd Qu.:3.333 3rd Qu.:2.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## ktransforming productivity gender studentstatus
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:3.667 1st Qu.:1.875 1st Qu.:1.000 1st Qu.:3.000
## Median :4.000 Median :2.500 Median :2.000 Median :4.000
## Mean :4.041 Mean :2.487 Mean :1.716 Mean :3.185
## 3rd Qu.:4.667 3rd Qu.:3.250 3rd Qu.:2.000 3rd Qu.:4.000
## Max. :5.000 Max. :4.750 Max. :2.000 Max. :4.000
## studylength
## Min. : 2.00
## 1st Qu.: 5.00
## Median : 14.00
## Mean : 21.63
## 3rd Qu.: 28.50
## Max. :172.00
Calculate and print the correlation matrix
cor_matrix<-cor(learning19[2:10]) %>% round(digits = 2)
cor_matrix
## unref deep orga blocks procrastination perfectionism
## unref 1.00 -0.48 -0.31 0.33 0.25 0.28
## deep -0.48 1.00 0.32 -0.27 -0.18 -0.19
## orga -0.31 0.32 1.00 -0.22 -0.38 -0.14
## blocks 0.33 -0.27 -0.22 1.00 0.55 0.54
## procrastination 0.25 -0.18 -0.38 0.55 1.00 0.35
## perfectionism 0.28 -0.19 -0.14 0.54 0.35 1.00
## innateability 0.16 -0.11 -0.02 0.24 0.13 0.28
## ktransforming -0.16 0.31 0.16 -0.30 -0.21 -0.25
## productivity -0.15 0.16 0.30 -0.38 -0.46 -0.22
## innateability ktransforming productivity
## unref 0.16 -0.16 -0.15
## deep -0.11 0.31 0.16
## orga -0.02 0.16 0.30
## blocks 0.24 -0.30 -0.38
## procrastination 0.13 -0.21 -0.46
## perfectionism 0.28 -0.25 -0.22
## innateability 1.00 -0.25 0.01
## ktransforming -0.25 1.00 0.21
## productivity 0.01 0.21 1.00
Specialized according to the significant level and visualize the correlation matrix p.mat <- cor.mtest(cor_matrix)$p
library(corrplot)
## corrplot 0.84 loaded
p.mat <- cor.mtest(cor_matrix)$p
corrplot(cor_matrix, method="circle", type="upper", tl.cex = 0.6, p.mat = p.mat, sig.level = 0.01, title="Correlations of learning19", mar=c(0,0,1,0))
learning19 %>%
ggplot(aes(orga, procrastination, color = cluster)) + # x = orga, y = procrastination
geom_point()
Euclidean distance matrix
learning19_eu <- dist(learning19[2:4])
summary(learning19_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.083 1.601 1.741 2.192 6.741
set.seed(123)
k_max <- 5 # determine the number of clusters
twcss <- sapply(1:k_max, function(k){kmeans(learning19[2:4], k)$tot.withinss}) # calculate the total within sum of squares
qplot(x = 1:k_max, y = twcss, geom = 'line') # visualize the results
The twcss value decrease heavily from 2 - 5 clusters. The optimal number of clusters was 3.
learning19_km <- kmeans(learning19[2:10], centers = 3)
Plot the dataset with clusters
pairs(learning19[2:10], col = learning19_km$cluster)
pairs(learning19[,2:4], col = learning19_km$cluster)
pairs(learning19[,5:10], col = learning19_km$cluster)
The optimal number of clusters was 3. We got the best overview with three clusters.
library(devtools)
library(flipMultivariates)
learning19_scaled3 <- scale(learning19[2:4])
learning19_km3 <-kmeans(learning19_scaled3, centers = 3)
cluster <- learning19_km3$cluster
learning19_scaled3 <- data.frame(learning19_scaled3, cluster)
lda.fit_cluster <- lda(cluster ~ ., data = learning19_scaled3)
lda.fit_cluster
Warning in install.packages : package ‘flipMultivariates’ is not available
but I used to run it so I kept the codes.
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes3 <- as.numeric(learning19_scaled3$cluster)
plot(lda.fit_cluster, dimen = 2, col = classes3, pch = classes3, main = "LDA biplot using three clusters")
lda.arrows(lda.fit_cluster, myscale = 2)
model_predictors <- dplyr::select(learning19_train, -deep2)
# check the dimensions
dim(model_predictors)
dim(lda.fit$scaling)
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
Next, install and access the plotly package.
Create a 3D plot of the columns of the matrix product.
library(plotly)
plot_ly (x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = learning19_train$deep2)
library(plot3D)
scatter3D(x = learning19$unref, y = learning19$deep, z = learning19$orga, col = NULL,
main = "learning19 data", xlab = "deep",
ylab ="unref", zlab = "orga")
library(plotly)
plot_ly (x = learning19$unref, y = learning19$deep, z = learning19$orga, type= 'scatter3d', mode='markers', color = learning19$deep)